Huffman code efficiencies for extensions of sources

نویسنده

Peter M. Fenwick

چکیده

It is well known that the efficiency of noiseless coding may be improved by coding extensions of the source; for large extensions the efficiency is arbitrarily close to unity. This paper shows that the efficiency is not always improved just by coding the next extension. In some cases the code of a larger extension is markedly less efficient than its predecessor, although always within the theoretical limits of efficiency. We show how the phenomenon arises from changes to the Huffman coding tree as the source probabilities change and investigate it for binary and ternary codes. Introduction. For a discrete memoryless information source S described by a source alphabet of symbols occurring with probabilities {P1, P2, P3, ...}, the entropy per source symbol is H(S) = – Pi.log( Pi ). When coding symbols from S for transmission over a noiseless channel it is usual to employ a variable length code, choosing shorter codewords for the more probable symbols. Given that each codeword has the probability Pi and length Li the expected length of the resultant code is L = Pi Li. A fundamental result of information theory is that H(S) L. Alternatively, if we define the code efficiency to be = H(S) / L, then 1. A second important result, known as Shannon’s first theorem, or the “noiseless coding theorem” (Abramson[1], p72), is that we can improve the coding efficiency by coding extensions of the source, i.e. grouping source symbols in groups of 2, 3 or more symbols and encoding the composite symbols. For the nth extension (coding n source symbols at a time) the efficiency is bounded by 1 > 1– 1/n. Thus by encoding a sufficiently large extension, the efficiency can be forced arbitrarily close to unity. The usual code in this situation is the Huffman code[4]. Given that the source entropy is H and the average codeword length is L, we can characterise the quality of a code by either its efficiency ( = H/L as above) or by its redundancy, R = L – H. Clearly, we have = H/(H+R). Gallager [3] Huffman Encoding Tech Report 089 October 31, 2007 Page 1 shows that the upper bound on the redundancy is P + 0.0861, where P is the probability of the most frequent symbol. Johnsen [5] and Capocelli etal [2] derive progressively tighter bounds to the redundancy, developing relations for different ranges of P. The above authors study only the “base” Huffman code, although a specified extension can be treated much as a simple code in its own right. What does not seem to have been examined as extensively is the behaviour as we code successively higher extensions of a given source alphabet. A naive application of the above bound ( 1 > 1– 1/n) shows that the efficiency should always improve as higher-order extensions are coded. The efficiency is generally assumed to approach the ideal quite quickly as source extensions are encoded (Abramson[1], p 87). Thus, if we consider the binary source P = {0.85, 0.15}, encoded with a binary Huffman code, we find that the efficiencies of the first few extensions are as shown in Table 1. The approach to the ideal is obviously quite rapid. Extension Efficiency 1 0.6098 2 0.8544 3 0.9663 4 0.9837 5 0.9926 Table 1. Efficiencies for extensions of the source {0.85, 0.15} Some Irregular Cases Although Shannon’s Theorem places bounds on the efficiency of a reasonable compact code, it says very little about the behaviour of an actual code. With slight changes to the source probabilities of Table 1, the Huffman code efficiencies for the binary source P = {0.8, 0.2} are shown in Table 2. For the first three extensions the efficiency improves as we would expect, but at the fourth extension it deteriorates markedly. The fifth extension is little better than the fourth and both are markedly worse than the the third. Even so, the code efficiency is still well within the theoretical bounds. Cases such as this, where the efficiency decreases with increasing extension, will be termed “irregular” and the contexts in which they occur “irregularities”. An “irregular extension” is one which gives a poorer efficiency than its immediate predecessor. Extension Efficiency 1 0.7219 2 0.9255 3 0.9917 4 0.9745 5 0.9783 Table 2. Efficiencies for extensions of the source {0.80, 0.20} To find other irregularities, a search was made of Huffman codes (binary codes of binary sources) for probabilities { , 1– } with varying from 0.05 to 0.50 in steps of 0.05 and for all extensions up to the 7th. The results are summarised in Table 3, with the efficiencies underlined for the irregular extensions. The irregularities all occur over ranges of extensions and probabilities, but Huffman Encoding Tech Report 089 October 31, 2007 Page 2 all areas are included in the table. We also see that even the example given first is poorly-behaved as soon as we look beyond the range shown in Table 1! In general we see that if an extension results in a particularly good code, it may be counter-productive to attempt to use a higher extension. Ext = 1 2 3 4 5 6 7 0.05 0.2864 0.4992 0.6610 0.7800 0.8519 0.9056 0.9429 0.10 0.4690 0.7271 0.8805 0.9522 0.9767 0.9975 0.9887 0.15 0.6098 0.8544 0.9663 0.9837 0.9926 0.9817 0.9914 0.20 0.7219 0.9255 0.9917 0.9745 0.9783 0.9954 0.9866 0.25 0.8113 0.9615 0.9859 0.9913 0.9921 0.9910 0.9932 0.30 0.8813 0.9738 0.9699 0.9882 0.9913 0.9922 0.9964 0.35 0.9341 0.9692 0.9840 0.9836 0.9963 0.9896 0.9963 0.40 0.9710 0.9710 0.9894 0.9896 0.9931 0.9945 0.9955 0.45 0.9928 0.9928 0.9928 0.9929 0.9946 0.9951 0.9959 Table 3. Efficiencies for extensions of a range of binary codes Analysis of the behaviour The previous paragraph shows that the efficiency sometimes decreases with higher extensions. The reason for this behaviour lies in the generation of the Huffman code and the variation of that coding as the symbol probabilities vary. The generated Huffman code, the shape of its associated tree and the distribution of codeword lengths are all critically dependent on the actual symbol probabilities. Each tree is optimum at one set of probabilities; as we move away from those values the coding will deteriorate. In many cases the coding with another tree will be improving until, when the two are equal, the code will “flip” to the other tree and its pattern of codeword lengths. We will thus get a discontinuity in the graph of efficiency against symbol probability as one tree takes over from the other.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Variable-to-fixed length codes are better than fixed-to-variable length codes for Markov sources

-It is demonstrated that for finite-alphabet, Kth order ergodic Markov Sources (i.e., memory of K letters), a variable-to-fixed code is better than the best fixed-to-variable code (Huffman code). It is shown how to construct a variable-to-fixed length code for a Kth order ergodic Markov source, which compresses more effectively than the best fixed-to-variable code (Huffman code).

متن کامل

The Optimal Fix-Free Code for Anti-Uniform Sources

An n symbol source which has a Huffman code with codelength vector Ln = (1, 2, 3, · · · , n − 2, n − 1, n − 1) is called an anti-uniform source. In this paper, it is shown that for this class of sources, the optimal fix-free code and symmetric fix-free code is C∗ n = (0, 11, 101, 1001, · · · , 1 n−2 { }} { 0 · · · 0 1).

متن کامل

Minimum Delay Huffman Code in Backward Decoding Procedure

For some applications where the speed of decoding and the fault tolerance are important, like in video storing, one of the successful answers is Fix-Free Codes. These codes have been applied in some standards like H.263+ and MPEG-4. The cost of using fix-free codes is to increase the redundancy of the code which means the increase in the amount of bits we need to represent any peace of informat...

متن کامل

فشرده سازی اطلاعات متغیر با زمان با استفاده از کد هافمن

Abstract: In this paper, we fit a function on probability density curve representing an information stream using artificial neural network . This methodology result is a specific function which represent a memorize able probability density curve . we then use the resulting function for information compression by Huffman algorithm . the difference between the proposed me then with the general me...

متن کامل

Bit-Based Joint Source-Channel Decoding of Huffman Encoded Markov Multiple Sources

Multimedia transmission over time-varying channels such as wireless channels has recently motivated the research on the joint source-channel technique. In this paper, we present a method for joint source-channel soft decision decoding of Huffman encoded multiple sources. By exploiting the a priori bit probabilities in multiple sources, the decoding performance is greatly improved. Compared with...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

IEEE Trans. Communications

دوره 43 شماره

صفحات -

تاریخ انتشار 1995

Huffman code efficiencies for extensions of sources

نویسنده

چکیده

منابع مشابه

Variable-to-fixed length codes are better than fixed-to-variable length codes for Markov sources

The Optimal Fix-Free Code for Anti-Uniform Sources

Minimum Delay Huffman Code in Backward Decoding Procedure

فشرده سازی اطلاعات متغیر با زمان با استفاده از کد هافمن

Bit-Based Joint Source-Channel Decoding of Huffman Encoded Markov Multiple Sources

عنوان ژورنال:

اشتراک گذاری